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Abstract. We consider a simplified version of a solvable model by Mandal and 
Jarzynski, which constructively demonstrates the interplay between work extraction 
and the increase of the Shannon entropy of an information reservoir which is in contact 
with a physical system. We extend Mandal and Jarzynski’s main findings in several 
directions: First, we allow sequences of correlated bits rather than just independent 
bits. Secondly, at least for the case of binary information, we show that, in fact, 
the Shannon entropy is only one measure of complexity of the information that must 
increase in order for work to be extracted. The extracted work can also be upper 
bounded in terms of the increase in other quantities that measure complexity, like the 
predictability of future bits from past ones. Third, we provide an extension to the 
case of non-binary information (i.e., a larger alphabet), and finally, we extend the 
scope to the case where the incoming bits (before the interaction) form an individual 
sequence, rather than a random one. In this case, the entropy before the interaction 
can be replaced by the Lempel-Ziv (LZ) complexity of the incoming sequence, a fact 
that gives rise to an entropic meaning of the LZ complexity, not only in information 
theory, but also in physics. 

Keywords; information exchange, second law, entropy, complexity. 


Sequence Complexity and Work Extraction 

1. Introduction 


2 


Information processing and the role that it plays in thermodynamics is a very well- 
known concept that dates back to the second half of the nineteenth century, namely, 
to James Clerk Maxwell and his famous gedanken experiment, known as Maxwell’s 
demon na. The Maxwell demon experiment shows that an intelligent agent, with 
access to measurements of velocities and positions of particles in a gas, is able to separate 
speedy particles from the slower ones, thereby creating a temperature difference without 
injecting energy into the system, which is seemingly in conflict with the second law of 
thermodynamics. Several decades later, Leo Szilard [I9] continued this line of thought, 
and demonstrated the conversion of heat into work, using a model of a box that contains 
a single particle. He showed that by measurement and control, one may be able to 
extract work in a closed cycle of the system, which is again, in apparent contradiction 
with to the second law. 

This suspected violation of the second law has triggered a long-lasting controversy 
and many other thought-provoking gedanken experiments that have eventually 
furnished the basis for a rather large of volume of theoretical work concerning the role 
and the implications of information processing in thermodynamics. A non-exhaustive 
list of recent works on the modern approach of incorporating informational ingredients 
in physical systems includes P, [2], [3], [5], |6], [H], [9], [iQ], [11], [H], [IS], [IE], [2Q], [21], 
and [22]. In some of these works, the informational resources are available by means of 
measurement and feedback control (like in the Maxwell’s demon and Szilard’s engine) 
and other works are about physical systems that include, in addition to the traditional 
heat reservoir, also an information reservoir^ which interacts with the system, but 
without any energy exchange. The main common motive in these works is in extended 
versions of the second law, where the expression of the entropy increase includes an extra 
entropic term that is associated with the information exchange. These extended versions 
of the second law are, of course, intimately related to Landauer’s erasure principle [T2] . 

Unlike earlier proposed thought experiments, that were mostly described in generic 
terms and were not fully specihed, Mandal and Jarzynski [H] were the hrst to propose 
an explicit solvable model of a concrete system that behaves in the spirit of the Maxwell 
demon. Specihcally, they described and analyzed a relatively simple autonomous system 
(based on a six-state Markov jump process), that when works as an engine, it converts 
thermal fluctuations (heat) into mechanical work, while writing digital information onto 
a running tape (in the role of an information reservoir), thereby increasing its Shannon 
entropy. It may also act as an eraser, which implements the opposite process of losing 
energy while erasing information, that is, decreasing the entropy. Several variations on 
this model, based on similar ideas, were offered in some subsequent works, e.g., n. 0, 
[3], and [T5] . 

In this paper, we consider a simplihed versioiJJ of Mandal and Jarzynski’s model 
[H] and we focus on extensions of their hndings in several directions. 

f Instead of the six-state Markov process of [14], we use a two-state process, which is easier to analyze. 
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(i) Allowing sequences of correlated bits rather than just independent bits. 

(ii) At least for the case of binary information, it is shown that, in fact, the Shannon 
entropy is only one measure of complexity of the information that must increase in 
order for work to be extracted. The extracted work can also be upper bounded 
in terms of the increase in other quantities that measure complexity, like the 
predictability of future bits from past ones. 

(iii) An extension is offered for the case of non-binary information (i.e., digital 
information with a larger alphabet). 

(iv) Extension of the scope to the case where the incoming bits (before the interaction) 
form an individual sequence, namely, a deterministic sequence rather than a random 
one. 

In the last item above, instead of the term of information entropy before the interaction, 
we have the Lempel-Ziv (LZ) complexity [23] of the incoming sequence, a fact that gives 
rise to an entropic meaning of the LZ complexity, not only in information theory, but 
also in physics. 

We believe that similar extensions can be offered also for the other variations of 
this model, that appear in n. 0.0. and [15], as mentioned. 

2. Notation Conventions 

Throughout the paper, random variables will be denoted by capital letters, specihc 
values they may take will be denoted by the corresponding lower case letters, and their 
alphabets will be denoted by calligraphic letters. Random vectors, their realizations 
and their alphabets will be denoted, respectively, by capital letters, the corresponding 
lower case letters, and the corresponding calligraphic letters, all superscripted by their 
dimension. For example, the random vector = (Xi,... ,X„), (n - positive integer) 
may take a specihc vector value x"' = (xi,...,x„) in A", which is the n-th order 
Cartesian power of A, the alphabet of each component of this vector. The probability 
of an event £ will be denoted by P[£]- The indicator function of an event £ will be 
denoted by '!■[£]. 

The Shannon entropy of a discrete random variable X will be denotecl^ by H{X), 
that is, 

H{X) = -Y.Pix)\nP{x), (1) 

where {P{x), x G A} is the probability distribution of X. When we wish to emphasize 
the dependence of the entropy on the underlying distribution P, we denote it by 'H{P). 
The binary entropy function will be dehned as 

h{p) =—p\np — {1 — p)\n{l — p), 0 < p < 1. (2) 

§ Following the customary notation conventions in information theory, H (A) should not be understood 
as a function H of the random outcome of X, but as a functional of the probability distribution of X. 


Sequence Complexity and Work Extraction 


4 


Similarly, for a discrete random vector = (Xi,..., X„), the joint entropy is denoted 
by H{X^) (or by H{Xi, ..., X„)), and defined as 

H{X^) = - P{x'")WP{x^). (3) 

The conditional entropy of a generic random variable U over a discrete alphabet U, 
given another generic random variable V G V, is defined as 

H{U\V) = - ^ ^P(n,n)lnP(n|n), (4) 

u&U vGV 

which shonld not be confused with the conditional entropy given a specific realization 
of V, i.e., 

i7(P|l/ = n) =-^P(M|'i;)lnP(M|n). (5) 

u&U 

The mutual information between U and V is 
I(U;V) = ff(U)-H(EjV) 

= H{V) - H{V\U) 

= H{U) + H{V)-H{U,V), (6) 

where it should be kept in mind that in all three definitions, U and V can themselves 
be random vectors. The Kullback-Leibler divergence (a.k.a. relative entropy or cross¬ 
entropy) between two distributions P and Q on the same alphabet X, is dehned as 

o(.P|IQ)= (7) 

3. Setup Description, Preliminaries and Objectives 

Consider the system depicted in the Fig. 1, which is a simplified version of the one in 

mi- 
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Figure 1. A system that interacts with a sequence of bits recorded on a running tape. 

A device that consists of a wheel that is loaded (via a another wheel with 
transmission) by a mass m, interacts with a running tape that bears digital information 
in the form of a series of incoming bits, denoted Xi,X 2 ,..., Xj G {0,1}, i = 1,2,.... 
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The device also interacts thermally with a heat bath at temperature T (not shown in 
Fig. 1) in the form of heat exchange, but there is no energy exchange with the tape. 
During each time interval of r seconds, ir < t < {i + l)r {i - positive integer), the 
device interacts with the i-th bit, Xj, in the following manner: If Xi = 0, then the initial 
state of the composite system (device plus bit) is ‘0’ and then, due to random thermal 
fluctuations, the wheel may spontaneously rotate, say, half a cycle counter-clockwise 
(CCW) at a random time, thereby changing the state of the system to T’ and thus 
causing the mass to be lifted by A (which is half the circumference of the bigger wheel 
in Fig. 1). Then, at a later random time, it may rotate clockwise (CW), changing the 
state back to ‘O’, and causing the mass to descend back by A, etc. The net change in 
the height of the mass, during this interval, depends, of course, only on the parity of 
the number of state transitions during this interval. At the end of this time interval, 
namely, at time t = (i + l)r — 0, the current state is recorded on the tape as the outgoing 
bit, denoted by yi. Note that if x, = 0 and = 1, then the net work done by the device, 
during this time interval, is AhFj = mg A] otherwise AWi = 0. Similarly, if the incoming 
bit is Xj = 1, then the initial state is ‘1‘ and then the hrst state transition (if any) is 
associated with a CW rotation. By the same reasoning as before, at the end of the 
time interval, if y^ = 0, then the net work done by the device, during this interval, is 
AWi = —mgA, otherwise, it is AWi = 0. Thus, in general, the work done during the 
Ath interval is AWi = mgA ■ [yi — Xj). Next, a new interval begins and it becomes 
the turn of bit Xj+i to interact with the device for r seconds, and so on. It should be 
emphasized that this transition from the former outgoing bit yi to a new incoming bit 
Xj+i is not accompanied by any energy exchange between the tape and the system (the 
wheel does not move in response to this transition). This new bit just determines which 
direction of rotation is enabled in which one is disabled. 

The above described mechanism of back and forth transitions (with their associated 
rotations) within each interval is modelled as a two-state Markov jump process with 
transition rates Aq^i and Ai^O) related by 

Ao^i = (8) 

giving rise to an equilibrium (Boltzmann) distribution 

g-mgA/fcX 

/’.Joi = 1 4- g-mgA/fcX ’ -^eq[l] g-mgA/fcT ’ (9) 

which manifests the fact that state ‘1’ is more energetic than state ‘O’, the energy 
difference being AE = mgA. At each interval, the temporal evolution of the probability 
of state ‘1’ is according to the master equation: 

—= Ao-).i — APt[l] (10) 

where A = Ao-j-i + Ai_,.o. This simple hrst order differential equation is readily solved by 

Pt[l] = ^ + (^o[l] - • e-"* 

= P.Jl] + (Po[l]-Peq[l])•e-^ 


( 11 ) 







Sequence Complexity and Work Extraction 


6 


and, of course, Pi[0] complements to unity. It is therefore readily seen that the 
mechanism that transforms the sequence of incoming bits, Xi,X 2 i ■ ■ ■■, into a sequence of 
outgoing bits, |/i,|/ 2 ,..., is simply a binary-input, binary-output discrete memoryless 
channellll (DMC) Q = x,y ^ {0,1}], whose transition probabilities are given by 

Qo^O = 1 ~ Qo^l = -Peq[0] + Peq[l] ' 6 (12) 

Ql^l = 1 ~ Ql^O = -Peq[l] + -Peq[0] ' 6 (13) 

The expected work done by the device after n cycles is given by 

{Wn) =mgA-l^[W-X,^ 

n 

= msA ■ = 1] - P|X, = 1]) 

i=l 

n 

= kTf ■ j:(F|y. = 1] - P[X, = 1]), (14) 

i=l 

where / = mgA/kT. Now, from the above derived time evolution of the state 
distribution within an interval of duration r, one easily hnds that 

P[Y, = 1] = P.J1] + (P[X, = 1] - P.J1]) • e-^", (15) 

which means a monotonic change, starting from P[Xi = 1] and ending at Peq[l]- In 
other words, P[Fj = 1] is always between P(Xj = 1) and Peq[l]- 

We next focus on the informational (Shannon) entropy production, namely, the 
difference between the entropy of the outgoing bit-stream {h)} and the entropy of the 
incoming bit-stream {Xi}. By the concavity of binary entropy function, h(-), it is easily 
seen that for every s, f G [0,1]: 

h{s) < h{t) + {s — t) ■ h'{t) = h{t) + {s — t) In-. (16) 

ij 

Thus, setting s = P[Xi = 1] and t = P\Yi = 1], we get 
H{Xi) = h(P[W = 1]) 

< h(P% = 1]) + {P[X, = 1] - P|y. = 1]) In (17) 

or equivalently, 

(P|y = 1| - P[.Y. = 1]) In < H(\\) - H(X,). (18) 

Now, if P\Yi = 1] > P[W = 1], then P^Jl] > P[T) = 1] > P[W = 1], and then 

(P|y = i| - p|Xj = 1]). / = (P|y. = i| - PiXj = 1]) In 

< (P|y, = 1] - P[Jf, = 1]) In 

< H(Yi) - H(Xt). (19) 


II A memoryless channel is characterized by the assumption that the conditional probability of y” given 
x" is given by the product of conditional probabilities of yi given Xi, i = 1,2,... ,n. 
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Similarly, if P[Yi = 1] < P[Xi = 1], then Peq[l] < = 1] < P[Xi = 1], and then 

again, 


{P[Yi = 1] - P[X, = !])•/< H{Yi) - H{Xi) (20) 

since the terms / and ln{(l—P[Fj = l])/P[Yi = 1]} are multiplied by {P[Yi = l]—P[Xi = 
1]), which is now non-positive. Thus, in both cases, the last inequality holds, and so, 
as is actually shown in m 

(AIV.) = kTf ■ (P|y, = 1] - P\X, = 1]) < kT\H{Yt) - H(Xi)]. (21) 

Summing from i = 1 io n, the left-hand side of fl2T]) gives 

n n 

(fT„) = kTf ■ J2iP[W = 1] - P[X^ = 1]) < kTj2[H{Yi} - H{X,)], (22) 

i=l i=l 

where left-hand-side (l.h.s.) is the total average total work after n cycles. The exact 
total average work is given by 

(fT„) = kTf • (1 - e-^") - EP[X, = 1] 

V i=i 

= tr/ . (1 - (^ FIX, = 0] - nF.,|0]i . (23) 

which is obviously positive if and only if ^JCf=iP[Xi = 0] > Peq[0]. If {Xi} are i.i.d. 
(Bernoulli), as assumed in [Tl] (as well as in subsequent follow-up papers mentioned 
earlier), then so are {Yi}, and the right-hand side (r.h.s.) of fl2^ agrees with the total 
informational entropy production, kTAH = kT[H{Y^) — H{X^)]. 

As discussed in [13], the inequality is saturated (in the sense that the ratio 
/■ {P[Yi = 1] —P[Xi = l])/[H{Yi} —H{Xi)] tends to unity) when P[Yi = 1] is very close 
to P[Xi = 1] (which happens if either Ar <C 1 or if P[Xi = 1] is very close to Peq[l], 
to begin with), but then the amount of work accumulated is very small. To approach 
the entropy difference limit when this difference is appreciably large, one may iterate 
in small steps, namely, work with Ar 1 and feed {Yj} as an incoming bit-stream to 
another (identical, but independent) copy of the same device to generate, yet another 
bit-stream {Zi} with a further increased entropy, etc. Alternatively, one may feed {Yj} 
back to the same system. This way, with many repetitions of this process, the total 
work would be very close to kT times the overall growth of the Shannon entropy. This 
idea is in the spirit of quasi-static reversible processes in thermodynamics and statistical 
mechanics. 

As explained in the Introduction, we extend these results in several directions: 

(i) Allowing the incoming bits, Xi, X 2 . ..., Xn, to be correlated rather than just 
independent, identically distributed (i.i.d.) bits. In this case, the sum of entropy 
differences, Jfi[H{Yi) —H[Xi)], at the r.h.s. of fl22|) is different, in general, from the 
correct expression of the increase in the total Shannon entropy, H{Y^) — H{X^), 
which in turn takes the correlations among the bits into account. It will be shown, 
nevertheless, that the correct expression associated with the entropy increase. 
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kT[H{Y'^) — H{X'^y\^ is still an upper bound on the average work. This holds 
true for an arbitrary joint distribution of (Xi, X 2 ,..., X„). 

(ii) At least for the case of binary information, it will be shown that an inequality like 
CD (even in its vector form) may hold even if the Shannon entropies on the r.h.s. 
are replaced by generalized entropies, which may serve as alternative measures of 
information complexity, such as the average probability of error in predicting the 
next bit Xj+i from the bits seen thus far Xi, X 2 ,..., Xj, i = 1, 2,..., n. 

(iii) We provide an extension of the above to the case of non-binary information, i.e., 
{Xj} and {Yi] take on values in a general hnite alphabet, whose size may be larger 
than 2. Under the general alphabet size setting, however, item (ii) above is no 
longer claimed. 

(iv) We extend the scope to the case where the incoming bits Xi,X 2 , ■ ■ ■ ,Xn form an 
individual sequence, namely, a deterministic sequence rather than a random one. 
In this case, in the r.h.s. of (1221) . the analogue of the probabilistic input entropy 
H{X^) will be (for large n) the Lempel-Ziv (LZ) complexity of the given sequence 
xi,X 2 , ■ ■ ■ ,Xn- As for the output entropy (Y"- is still a random vector), we will 
provide computable bounds. 

4. Correlated Input Bits 

Consider the case where the binary random vector (Xi,..., X„), of the hrst n input bits, 
has a general joint distribution. As said, in this case, the r.h.s. of eq. (122|) is no longer 
associated with the correct overall change in the Shannon entropy, H{Y"‘) — H{X'^). 
Nonetheless, our purpose, in this section, is to show that the latter expression (times 
kT) continues to be an upper bound on the expected work. 

We proceed as follows. Using the fact that channel Q connecting X” and Y"^ is a 
DMC: 

n 

H{Y^) - H{X^) = Y.[H{Y,\Y^-^) - ih(W| W)] 

i=l 

n 

> ^[if(u|x*-i,w-i)-hr(w|w-i)] 

i=l 

n 

= J2[HiW\X^-^) - HiX,\X^-^)] 

i=l 

n 

= E E p(x*-')[hf(u|w-i = - 

i=l x'^~^ 

H{Xi\X^-^ = Y-^)] 

n 

= E E P{x^-^){KP[Yi = 1|W-1 = Y-^]) - 

*=1 

h{P[X, = l\X^-^ = Y-^])} 

n 

> / ■ E E PiP-^){P[Yi = 1|X*-1 = - 

i=l x'^~^ 
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P[X, = l\X^-^ = x^-^]) 

n 

= f ■ = 1 ] - = 11 ) 

i=l 

{Wn) 
kT ’ 


(24) 


where the third line is due to the fact that Yi is statistically independent of ^ given 
and the second inequality is again due to the concavity of h(-). 


Discussion. We have two upper bounds on the total work, kT JCi=i[H(Yi) — H(Xi)] 
and kT[H(Y"') — H(X'^)]. As an upper bound, the former is always tighter, in other 
words, we argue (see Appendix A for the proof) that 

n 

H(Y^) - H(X^) > Y.[H(Yi) - H(Xi)l (25) 

i=l 

and so for the purpose of bounding the expected work, there is no point in looking 
at higher order entropies of the incoming and outgoing processes. However, from the 
physical point of view, the inequality (HA) < kT[H{Y^) — H(X'^)] remains meaningful 
since the difference k[H{Y'^) — H(X'^)] — (HA) /T has the natural meaning of the 
total entropy production (of the combined system and its environment) for the more 
general case considered, i.e., where {Xi} may be correlated. The non-negativity of 
this difference is then a version of the (generalized) second law of thermodynamics for 
systems that include information reservoirs. It follows from this discussion that if one 
has any control on the incoming bit sequence, then introducing correlations among them 
is counter-productive in the sense that it only enlarges the entropy production without 
enlarging the extracted work (for a given marginal probability assignment). In other 
words, among all input vectors with a given average marginal, P[x] = ^ SILi P[^i = 
the best one is an i.i.d. process (i.e., a Bernoulli process) with a single-bit marginal given 
by P[Xi = x\ = P[x\ for all i. In any other case, there is an extra entropy production 
due to input correlations. 

Note that if X” is a codeword from a rate-i? channel block code (with equiprobable 
messages) for reliable communication across the channel Q, namely, H(X‘^) = nR and 
H(X'^\Y^) is small by Fano’s inequality [U Section 2.10]), then 

H(W) - H(X^) ^ H(Y^\X^) 

n 

= Y. H(Y,\Xi) = n[P[0]h(Qo^o) + P[l]h(Qi^^)]. (26) 

i=l 

In this case, as H(Y^) ^ n[R + .P[0]h((5o^o) + -P[l]^(Qi-i>i)]) one can reliably recover 
from both the incoming process X"" and the entire history of of (net) movements of 
the wheel across the various intervals, so no information is lost. 
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Note that the only properties of the entropy function that were used in Section 3 were: 
(i) concavity, and (ii) h'{P^^[l]) = f. The second property does not pose any serious 
limitation because any concave function can either be scaled or added with a linear 
term (both without harming the concavity property), so that (ii) would hold. It follows 
then that the Shannon entropy is not the only measure that describes the increased 
complexity of information that must accompany the extracted work. In other words, 
there are additional measures for the amount extra randomness or the “amount of 
information” that must be written in order to make the system convert heat to work. 

We describe a generalized entropy function that is based on a function L^^w), which 
is an arbitrary function of x G {0,1} and a variable w G W, that can be thought of 
as a ‘loss’ associated with the choice of w when the observation is x. We then dehne a 
generalized entropy function as the minimum achievable average loss associated with a 
binary random variable X, with P[X = 1] = 1 — P[X = 0] = p, that is 

h{p) = min[(l - p) ■ Lq{w) + p- Li{w)]. (27) 

w£S 

Indeed, the binary Shannon entropy h{p) is obtained as a special case for Lq{w) = 
— ln(l — w) and Li{w) = — Into, W = [0,1], as the minimum is attained for w* = p. 
Since h{p) is the minimum of affine functions of p, it is clearly concave. Two additional 
examples of entropy-like functions are the following: 


(i) Let Lxiw) = X[w 7 ^ x], W = {0,1}, measure the loss in (possibly erroneous) 
‘guessing’ of x by w. In this case, h,(p) = minjp, 1 — p}. 

(ii) The squared-error loss function, L^iw) = (x — tc)^, W = [0,1], yields h{p) = 
p(l -p). 


The extension of fl 2 T]) now asserts that the average work extraction (AlW), within a 
single cycle, cannot exceed 

mg A kTf 


■Ah = 


■Ah, 


(28) 


h'(P.M 

where Ah = h{P[Yi = l])—h{P[Xi = 1]) is the increase in the (generalized) ‘complexity’ 
in Yi relative to Xi, and where we have assumed that h{-) is differentiable at p = Peq)!]- 
We will comment on the non-differentiable case shortly. 

Denoting H[Xi) = h[P[Xi = 1]) and HiYi) = h[P\Yi = 1]), we can generalize 
the above discussion (including fl 2 T|) . provided that the hrst equality is considered a 
dehnition) to correlated sequences of bits, by introducing the dehnition 


if(W|W-i) = ^ P{Y-^)h{P[X, = 1|W-1 = x'-'j) (29) 

a;»-l 

and similar dehnitions for the other generalized conditional entropies. Considering the 
hrst example above, iT(Aj|X*“^) designates the predictability [7] of Xi given i.e., 

the minimum achievable probability of error in guessing Xi from which is certainly 

a reasonable measure of complexity. As for the second example above, H{Xi\X'^~^) has 
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the meaning of the minimum mean squared error in estimating Xi based on Here, 

h,'(Pgq[l])) = tanh(//2). Thus, the factor kTf /h'{P^^[l])) = kTf / tanh(//2), which is 
about kT f = mg A at very low temperatures, and about 2kT at very high temperatures. 

On a technical note, observe that in general h{-) may not be differentiable at 
Feq[l], but due to the concavity, there are always one-sided derivatives h,'^_(Peq[l]) = 
lim 5 ^o[^(^eq[l] + 5) - h{P,^[l])]/6 and h'_{P,^[l]) = hm 5 -^o[^(^eq[l] + <5) - h,(P,Jl])]/(5, 
with h,'_(Pgq[l]) > h.'j_(Pgq[l]). We can always use either one. In case of a strict 
inequality, we can choose the one that gives the tighter inequality, namely, /i'„(Peq[l]) if 
J2iP[yi = 1 ] > = 1 ] and /i+(Peq[l]) otherwise. 

Another class of generalized entropies obey the form H{X) = (S'[1/P(X)]), where 
S is am arbitrary concave function (e.g., S[u] = Inu gives the Shannon entropy), which is 
easily seen to be concave functional of P. In the binary case considered here, this would 
amount to h{p) = pS[l/p] + (1 — p)S'[l/(l — p)]. The concavity property guarantees 
that our earlier arguments hold for this kind of generalized entropy as well. Similar 
comments apply to yet another class of generalized entropies, H{X) = Sx 
where S is again concave (e.g., S[u] = —uliau gives the Shannon entropy). 

This discussion sets the stage for a richer family of bounds on the extracted 
work, which depend on various notions of sequence complexity. Provided that h{-) is 
differentiable at Peq[l], these bounds are asymptotically met in the limit of infinitesimally 
small differences between P[Yi = 1] and P[W = 1], as discussed above in the context of 
the ordinary entropy. Nonetheless, among all generalized entropies we have discussed, 
only the Shannon entropy is known to be invariant under permutations, e.g., for 
n = 2 , H{Xi) + H{X 2 \Xi) = H{X 2 ) + H{Xi\X 2 ), but in general, it not true that 
H{Xi) + H{X 2 \Xi) = H{X 2 ) + H{Xi\X 2 ). Also, it is not clear if and how any of the 
other entropy-like functionals continue to serve in bounding the average work when the 
setup is extended to larger alphabets (see Section 6 below). These two points give rise 
to the special stature of the ordinary Shannon entropy, which prevails in a deeper sense 
and in more general situations. 

6. Non—Binary Seqnences 

In this section, we extend the results from the case where {xi} and {yi} taken on binary 
values to the case of a general hnite alphabet of size K. Correspondingly, in this case, 
the underlying system dymamics would be associated with some Markov jump process 
having K states. Associated with each state s, there would a corresponding height 
increment, A(s) (which may be positive or negative), relative to some reference state, 
say So, with A(so) = 0. Each state s is then associated with energy, P(s) = mgA{s), and 
the transition rates of the underlying Markov process obey detailed balance accordingly. 
For the given interaction interval of length r, let Pt denote the JP-dimensional vector of 
state probabilities at time t, so that the initial distribution vector Pq corresponds to the 
probability distribution of the incoming symbol, the final distribution P^, corresponds 
to that of the outgoing symbol, and Poo = Peq- The Markovity of the process implies 
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that Zl(PjPeq) is monotonically non-increasinj^ in t (see, e.g., [1], [IS]), and so, 

P(P.||PJ <P(Po||Pj, (30) 


which after straightforward algebraic manipulation, becomes 

1 


i:(P,W - P„W) In < H{Pr) - H(P„). 


Since 


In 


^eq[s] 


= InZ 


-Peq[S| 


mgA{s) 

kT 


(31) 


(32) 


Z = I]^exp{—m 5 fA(s)/fcT} being the partition function, the l.h.s. of fl^ gives the 
average work per cycle (in units of kT), and the r.h.s. is, of course, the entropy difference. 
Note that we have assumed nothing about the structure of the underlying Markov 
process except detailed balance. This discussion easily extends to the case of correlated 
input symbols, as in Section 4. 


7. Individual Sequences and the LZ Complexity 

Finally, we extend the scope to the case where Xi,X 2 ,-- - is an individual sequence, 
namely, an arbitrary deterministic sequence, with no assumptions concerning the 
mechanism that has generated it. The outgoing sequence is, of course, still random 
due to the randomness of the state transitions. In this setting, the LZ complexity of 
the incoming sequence will play a pivotal role, and therefore, before moving on to the 
derivation for the individual-sequence setting, we pause to provide a brief background 
concerning the LZ complexity, which can be thought of as an individual-sequence 
counterpart of entropy. 

In 1978, Ziv and Lempel [23] invented their famous universal algorithm for 
data compression, which has been considered a major breakthrough, both from the 
theoretical aspects and the practical aspects of data compression. For an given 
(individual) inhnite sequence, xi,X 2 ,---, the LZ algorithm achieves a compression 
ratio, which is asymptotically as good as that of the best data compression algorithm 
that is implementable by a finite-state machine. To the first order, the compression 
ratio achieved by the LZ algorithm, upon compressing the first n symbols, x" = 
(xi, X 2 , ■ ■ ■, Xn), i.e., the LZ complexity of x”, is about 

^ c(^")l°gc(.") ^ (33, 

n 

where c(x”) is the number of distinct phrases of x” obtained upon applying the so 
called incremental parsing procedure. The incremental parsing procedure works as 
follows. The sequence Xi,X 2 ,... ,x„ is parsed sequentially (from left to right), where 

^ This monotonicity property is, in fact, an extended version of the H-Theorem to the case where the 
equilibrium distribution is not necessarily uniform. Informally speaking, while the H-Theorem is about 
the increase in (Shannon) entropy in an isolated system, this monotonicity property of the divergence 
symbolizes the decrease in free energy more generally. 
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each parsed phrase is the shortest string that has not been encountered before as a 
parsed phrase, except perhaps the last phrase, which might be incomplete. For example, 
the sequence = 10001101110100010 is parsed as 1,0,00,11,01,110,10,001,0. The 
first two phrases are obviously T and ‘0 as there is no ‘history’ yet. The next ‘0’ has 
already been seen as a phrase, but the string ‘00’ has not yet been seen, so the next 
phrase is ‘00’. Proceeding to the next bit, ‘1’ has already appeared as a phrase, but ‘11’ 
has not, and so on. In this example then, = 9. The idea of the LZ algorithm 

is to sequentially compress the sequence phrase-by-phrase, where each phrase, say, 
of length r, is represented by a pointer to the location of the appearance of the first 
r — 1 symbols as a previous phrase (already decoded by the de-compressor), plus an 
uncompressed representation of the r-th symbol of that phrase. It is shown in |23] that 
if the LZ algorithm is applied to a random vector that is sampled from a stationary 
and ergodic process, then p{X^) converges with probability one to the entropy rate of 
the process, H = In that sense, p(x"') can be thought of as an 

analogue of entropy in the individual-sequence setting. 

The general idea, in this section, is that, in the context of the entropic upper bound 
on the extracted work, the role of the input entropy, H{X"‘), of the probabilistic case, 
will now be played by p(x”), whereas H(Y^) will be upper bounded in terms of p(x”'). 
Thus, the concept of LZ complexity is not only analogous to information-theoretic 
entropy, but in a way, it also plays an entropic role in the physical sense. 

Equipped with this background, we now move on to the derivation. For simplicity, 
we consider the binary case, but everything can be extended to the non-binary case, 
following the considerations of Section 6. Consider then an individual binary sequence 
{xi,X 2 ,..., Xn) of incoming bits. Let be a divisor of n and chop the sequence into n/^ 
non-overlapping blocks of length Xi = Xj^+ 2 ,..., 3;*^+^), i = 0,1 ,..., n/£ — 1. 

Consider now the empirical distribution of ^-blocks 

p n/l-l 

= - E e {0, ly (34) 

Now, define 

P[X, = l] = J2P{xy, 1 = 1,2,...,i (35) 

where the summation is over all binary ^-vectors {x^} whose i-th coordinate is 1. The 
average work for a given (xi, X2, ■ ■ ■, Xn) is given by 

n 

(IV„) ^kTf-Y.{(Y,)-x,) 

t=l 

^{PlY, = 1] - = 1]) 

^ i=l 

< kEE . [H(Y‘) - H(X‘)] (36) 

where P\Yi = 1] = P[Xi = l]Qi_ 5 .i -|- P[Xi = 0](5o->-i, H{Xp is the empirical entropy 
of ^-blocks associated with x” = (xi,X 2 ,... ,Xn) and H(Yp is the output entropy of 
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•^-vectors that is induced by the input assignment {P{x^)} and I uses of the memoryless 
channel Q. The last inequality is simply an application of the results of Section 4 to 
the case where the joint distribution of is -P(-). This already gives some meaning to 
the notion of entropy production in this case, where the incoming bits are deterministic. 
However, the choice of the parameter I (among the divisors of n) appears to be somewhat 
arbitrary. In the following, we further obtain another bound, which is asymptotically, 
independent of i. In this bound, H{X^) will be replaced by p{x^). From [T^ eq. (21)], 
we have the following lower bound on H{X^) in terms of its LZ complexity (setting the 
alphabet size to 2 and passing logarithms to base e): 

H{X^) 8£ln2 _ 2MHn2 _ ln2 

i —P^ (1 —e„)logn n i 

= p(a;") - (5(n,£), (37) 

where —?■ 0. This inequality is a result of comparing the compression ratio of a certain 
block code to a lower bound on the compression performance of a general finite-state 
machine, which is essentially p{x"‘). Of course, lim^_j.oo lim„_j.oo d{n, £) = 0. Let Sn denote 
the minimum of 6{n,i) over all {€\ that are divisors of n. 

It remains to deal with the entropy of First, observe that the case of very large 
r is obvious, because in this case, = n'H{P^q) as {Fj} is i.i.d. with marginal P^^, 

regardless of x^. Therefore, neglecting the term 6n, the upper bound on the extracted 
work becomes 


(Wn) < kTnimPj - p{x^)]. (38) 

For a general r, we proceed as follows. Given the binary-input, binary-output DMC 
Q : X ^ Y, define the single-letter function 


U{z) = m&x{H(Y) : H{X) > z}. 


(39) 


The function U{z) is concave and monotonically decreasing. The monotonicity is 
obvious. As for the concavity, indeed, let Pq and Pi be the achievers of U{zo) and 
U{zi), respectively. Then, for 0 < A < 1, the entropy of Pa = (1 — A)Po + APi is never 
less than (1 — A)zo + Azi, and so. 


P[(l — A)zo + A2 ;i] > H(Yx) Y\ being induced by P\ and Q 
> (1 - A)P(yo) + AP(ri) 

= {l-\)U{zo) + \U{zi). (40) 

Note that if H{X^) > iz, then a-fortiori, H{Xi) > iz, and so, for the given DMC, 

H(Y’)< 

i=l 

< j:u[H{x,)] 


<i-U 


1 

1 




2=1 


<e-u{z). 


(41) 
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Applying this to the input distribution {P{x^)} and the channel Q, we have, by (I57|) : 




(42) 


and so, dehning the function 

V(z) = U{z) - z, 


(43) 


which is concave and decreasing as well, we get the following upper bound on (Wn) in 
terms of the LZ complexity of x”: 


{Wn) < kTfn ■ V [p(x") - 5^]. 


(44) 


It tells us, among other things, that the more is LZ-compressible, the more work 
extraction one can hope for. 

This upper bound is tight in the sense that no other bound that depends on x"' 
only via its LZ compressibility p{x"') can be tighter, because for a given value p (in the 
range where the constraint in the maximization dehning U (p) is attained with equality) 
of the LZ compressibility, p{x^), there exist sequences with LZ compressibility p for 
which the bound kTfnV{p) is essentially attained. This is the case, for example, for 
most typical sequences of the memoryless source P* that achieves U{p). Tighter bounds 
can be obtained, of course, if more detailed information is given about the empirical 
statistics of x". 

The important point about the function U (and, of course, V) is that, in the 
jargon of information theorists, it is a single-letter function, that is, its calculation 
requires merely optimization in the level of marginal distributions of a single symbol, 
and not distributions associated with vectors. In Appendix B, we provide an explicit 
expression of U{z). 
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Appendix A 

Proof of Eq. 1^) . 

The proof is by induction: For n = 1, this is trivially true. Assume that it is true for 
a given n. Then, by the memorylessness of the channel Q, F"” —)■ A" —)■ Yn+i 

is a Markov chain, and so, by the data processing theorem [H Section 2.8] 


H{X^) + H{Xn+i) - H{X^+^) = J(A„+i; X^) 


>I{Yn+i;Y-) 

= H{Y^) + HiYn+i) - iF(y"+'), (45) 


which is equivalent to 


/P(y-+i) _ /7(X"+i) > [HiY'^) - HiX^)] + [H{Yn+i) - H{Xn+i)], (46) 
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and so. 

n 

H{Y^) - > Y.[H{Yi) - H{Xi)] 

2=1 

(47) 

implies 




n+l 

//(yn+l) _ > Y.[h{y;) - H{Xi)], 

i=l 

(48) 


completing the proof. 


Appendix B 


Deriving an Explicit Expression for U{z). 

For the case of the binary-input, binary-output channel at hand, let us denote 
^0 = Qo^o and ei = Qi^o, and assume that Cq > Ci (otherwise, switch the roles of 
the inputs). If the input assignment is {p,l — p), then the output entropy is clearly 
h{peo + pci) {p being 1 — p). The constraint h{p) > z is equivalent to the constraint 
h~^{z) < p < 1 — where h~^{s) is the smaller of the two solutions {«} to the 

equation h{u) = 2;. Denoting 

= eoh~^{z) + ei[l - h~^{z)] (49) 

13^ = eQ[l-h~^{z)]+eih~^{z) (50) 

then eo > ei implies fiz > and then 


U{z) = max{h(g) : az < q < (3z} 

f h{f3z) ^<\ 

= 1 In 2 az<\< Pz (51) 

[ h{az) az>l 

The condition fiz < 1/2 is satished always if eo < 1/2. For eo > 1/2 > ei, this condition 
is equivalent to 

z > h [ — -= 2;* (52) 

\ eo-ei J 


Similarly, the condition > 1/2 is satished always if ei > 1/2. For ei < 1/2 < eo, this 
condition is equivalent to 


z > h 


' 1/2-ei ' 

^0 - D, 


= 2; 


* 


Thus, to summarize, U{z) behaves as follows: 


(i) For ei > 1/2, U{z) = h{az) for all z G [0,1]. 

(ii) For eo < 1/2, U{z) = h{(3z) for all z G [0,1]. 
(hi) For ei < 1/2 < eo and eo + ei > 1 


U{z) 


In 2 0<z<z* 

h{az) z* < z < 1 


( 53 ) 
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ln2 t)<z<z* 
h{fdz) z* < z < 1 

Note that for the binary symmetric channel (eo + ei = 1), trivially U{z) = ln2 for all 
z G [0,1]. Also, in all cases U{1) = h[(eo + ei)/2]. 
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